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Abstract. We propose a pivotal method for estimating high-dimensional 
sparse linear regression models, where the overall number of regressors p is 
large, possibly much larger than n, but only s regressors are significant. The 
method is a modification of the lasso, called the square-root lasso. The method 
is pivotal in that it neither relies on the knowledge of the standard deviation 
(T or nor does it need to pre-estimate a. Moreover, the method does not rely 
on normality or sub-Gaussianity of noise. It achieves near-oracle performance, 
attaining the convergence rate a{{s/n) logp}^/^ in the prediction norm, and 
thus matching the performance of the lasso with known tr. These performance 
results are valid for both Gaussian and non-Gaussian errors, under some mild 
moment restrictions. We formulate the square-root lasso as a solution to a 
convex conic programming problem, which allows us to implement the estima- 
tor using efficient algorithmic methods, such as interior-point and first-order 
methods. 



1. Introduction 

We consider the linear regression model for outcome yi given fixed p-dimensional 
regressors xt: 

Hi = x'iPo + aei = (1) 

with independent and identically distributed noise (i = 1, having law Fq 

such that 

Ep,{e,)^0 , EF,{ef)^l. (2) 

The vector /3o G is the unknown true parameter value, and cr > is the unknown 
standard deviation. The regressors Xi are p-dimensional, Xi = {xij,j = 
where the dimension p is possibly much larger than the sample size n. Accordingly, 
the true parameter value /3o lies in a very high-dimensional space W. However, the 
key assumption that makes the estimation possible is the sparsity of f3o' 

T ~ supp(/3o) has s < n elements. (3) 

The identity T of the significant regressors is unknown. Throughout, without loss 
of generality, we normalize 
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-1 {j^l,...,p). (4) 



In making asymptotic statements below we allow for s — > oo and p — > oo as n — > oo. 

The ordinary least squares estimator is not consistent for estimating /?□ in the 
setting with p > n. The lasso estimator [23] can restore consistency under mild 
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conditions by penalizing through the sum of absolute parameter values: 

^eargming(/3) + -||/3||i, (5) 



where Q{l3) ~ n^^J27=i(yi ~ ^'iPY ^-^^d ||/?||i = YTj=i The lasso estimator 
is computationally attractive because it minimizes a structured convex function. 
Moreover, when errors are normal, Fq = A^(0, 1), and suitable design conditions 
hold, if one uses the penalty level 

A = crc2ni/2$~i(l - a/2p) (6) 

for some constant c > 1, this estimator achieves the near-oracle performance, 
namely 

W - Poh < a {s\og{2pla)/nf'\ (7) 

with probability at least 1 — a. Remarkably, in (7) the overall number of regressors 
p shows up only through a logarithmic factor, so that if p is polynomial in n, the 

1 /2 

oracle rate is achieved up to a factor of log ' n. Recall that the oracle knows the 
identity T of significant regressors, and so it can achieve the rate a{s/n)^^^. Result 
(7) was demonstrated by [4], and closely related results were given in [13], and [28]. 
[6], [25], [10], [5], [30], [8], [27], and [29] contain other fundamental results obtained 
for related problems; see [4] for further references. 

Despite these attractive features, the lasso construction (5) - (6) relies on know- 
ing the standard deviation a of the noise. Estimation of a is non-trivial when p 
is large, particularly when p ^ n, and remains an outstanding practical and the- 
oretical problem. The estimator we propose in this paper, the square-root lasso, 
eliminates the need to know or to pre-estimate a. In addition, by using moderate 
deviation theory, we can dispense with the normality assumption Fq = ^ under 
certain conditions. 

The square-root lasso estimator of /3o is defined as the solution to the optimiza- 
tion problem 

^eargmin{Q(/?)}V2 + ^||/3||,, (8) 

with the penalty level 

A = cni/2$-i(l - Q;/2p), (9) 

for some constant c > 1. The penalty level in (9) is independent of tr, in contrast to 
(6), and hence is pivotal with respect to this parameter. Furthermore, under rea- 
sonable conditions, the proposed penalty level (9) will also be valid asymptotically 
without imposing normality -Fo = '1', by virtue of moderate deviation theory. 

We will show that the square-root lasso estimator achieves the near-oracle rates 
of convergence under suitable design conditions and suitable conditions on Fq that 
extend significantly beyond normality: 

W-Ph<o{s\og{2pla)/nY'\ (10) 

with probability approaching 1 — a. Thus, this estimator matches the near-oracle 
performance of lasso, even though the noise level a is unknown. This is the main 
result of this paper. It is important to emphasize here that this result is not a direct 
consequence of the analogous result for the lasso. Indeed, for a given value of the 
penalty level, the statistical structure of the square-root lasso is different from that 
of the lasso, and so our proofs are also different. 
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Importantly, despite taking the square-root of the least squares criterion func- 
tion, the problem (8) retains global convexity, making the estimator computa- 
tionally attractive. The second main result of this paper is to formulate the 
square-root lasso as a solution to a conic programming problem. Conic program- 
ming can be seen as linear programming with conic constraints, so it generalizes 
canonical linear programming with non-negative orthant constraints, and inherits 
a rich set of theoretical properties and algorithmic methods from linear program- 
ming. In our case, the constraints take the form of a second-order cone, leading to 
a particular, highly tractable, form of conic programming. In turn, this allows us to 
implement the estimator using efficient algorithmic methods, such as interior-point 
methods, which provide polynomial-time bounds on computational time [16, 17], 
and modern first-order methods [14, 15, 11, 1]. 

In what follows, all true parameter values, such as /3q, cr, Fq, are implicitly 
indexed by the sample size n, but we omit the index in our notation whenever this 
does not cause confusion. The regressors Xi {i = 1, . . . ,n) are taken to be fixed 
throughout. This includes random design as a special case, where we condition on 
the realized values of the regressors. In making asymptotic statements, we assume 
that n — >■ OO and p = p„ — > oo, and we also allow for s = s„ — >■ oo. The notation 
o(-) is defined with respect to n — >■ oo. We use the notation (a)+ = max(a,0), 
aV b = max(a, b) and a A b ~ min(a, b). The £2-norm is denoted by || ||2, and £oo 
norm by ||||tx3- Given a vector 5 € and a set of indices T C {l,...,p}, we 
denote by St the vector in which Sxj = Sj if j G T, Stj = if j ^ T. We also 
use En{f) ~ En{f{z)} ~ X]"=i fi^i)/''^- We use a < 6 to denote a ^ cb for some 
constant c > that does not depend on n. 

2. The choice of penalty level 

2.1. The general principle and heuristics. The key quantity determining the 
choice of the penalty level for square-root lasso is the score, the gradient of Q^/^ 
evaluated at the true parameter value /3 = Pq: 



miMY" {Eu{a^e^)Yr2 {£;„(e2)}i/2- 

The score S docs not depend on the unknown standard deviation a or the unknown 
true parameter value /3o, and therefore is pivotal with respect to (/3o, c). Under the 
classical normality assumption, namely Fq = the score is in fact completely 
pivotal, conditional on X. This means that in principle we know the distribution 
of S in this case, or at least we can compute it by simulation. 

The score S summarizes the estimation noise in our problem, and we may set 
the penalty level \/n to overcome it. For reasons of efficiency, we set \/n at the 
smallest level that dominates the estimation noise, namely we choose the smallest 
A such that 

A^cA, A = n||5||oo, (11) 
with a high probability, say 1 — a, where A is the maximal score scaled by n, and 
c > 1 is a theoretical constant of [4] to be stated later. The principle of setting A to 
dominate the score of the criterion function is motivated by [4] 's choice of penalty 
level for the lasso. This general principle carries over to other convex problems, 
including ours, and that leads to the optimal, near-oracle, performance of other 
£i-penalized estimators. 
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In the case of the square-root lasso the maximal score is pivotal, so the penalty 
level in (11) must also be pivotal. We used the square- root transformation in the 
square-root lasso formulation (8) precisely to guarantee this pivotality. In con- 
trast, for lasso, the score S = VQ(/3o) = 2(TEn(xe) is obviously non-pivotal, since 
it depends on a. Thus, the penalty level for lasso must be non-pivotal. These 
theoretical differences translate into obvious practical differences. In the lasso, we 
need to guess conservative upper bounds a on ct, or we need to use preliminary 
estimation of a using a pilot lasso, which uses a conservative upper bound a on 
a. In the square-root lasso, none of these is needed. Finally, the use of pivotality 
principle for constructing the penalty level is also fruitful in other problems with 
pivotal scores, for example, median regression [3]. 

The rule (11) is not practical, since we do not observe A directly. However, we 
can proceed as follows: 

1. When we know the distribution of errors exactly, e.g., Fq = $, we propose to 
set A as c times the (1 — a) quantile of A given X. This choice of the penalty level 
precisely implements (11), and is easy to compute by simulation. 

2. When wc do not know Fq exactly, but instead know that Fq is an element 
of some family F, we can rely on either finite-sample or asymptotic upper bounds 
on quantiles of A given X. For example, as mentioned in the introduction, under 
some mild conditions on J^, A = cn^/^$~^(l — a/2p) is a valid asymptotic choice. 

What follows below elaborates these approaches. Before describing the details, 
it is useful to mention some heuristics for the principle (11). These arise from 
considering the simplest case, where none of the regressors are significant, so that 
/3q ~ 0. We want our estimator to perform at a near-oracle level in all cases, 
including this one. Here the oracle estimator is /? = /3o = 0. We also want /? = 
/3o = in this case, at least with a high probability 1 — a. From the subgradient 
optimality conditions of (8), in order for this to be true we must have —Sj -\- \/n ^ 

and Sj + X/n > {j = 1,. . . ,p). We can only guarantee this by setting the 
penalty level X/n such that A ^ nmaxi^j^p \Sj\ = n||S'|joo with probability at least 

1 — a. This is precisely the rule (11), and, as it turns out, this delivers near-oracle 
performance more generally, when /3o ^ 0. 

2.2. The formal choice of penalty level and its properties. In order to de- 
scribe our choice of A formally, define for < a < 1 

Ai.(l - a I X) = (1 - Q;)-quantile of A^. | X, (12) 
A(l - a) = n^^^^~\l - al2p) {2n log(2p/a)}^/^ (13) 

where Ajr = n\\^n{xC)\\oo/ {En[£,'^)Y^'^ i with independent and identically distributed 
(i = 1, . . . , n) having law F. We can compute (12) by simulation. 
In the normal case, Fq ~ ^. X can be either of 

A = cA$(l-a|X), 

A = cA(l-a) ==cni/2$-i(i_Q,/2p), ^'^^i 

which we call here the exact and asymptotic options, respectively. The parameter 
1 — a is a confidence level which guarantees near-oracle performance with probability 
1 — a; we recommend 1 — a = 0.95. The constant c > 1 is a theoretical constant 
of [4], which is needed to guarantee a regularization event introduced in the next 
section; we recommend c~ 1.1. The options in (14) are valid either in finite or large 
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samples under the conditions stated below. They are also supported by the finite- 
sample experiments reported in Section C. We recommend using the exact option 
over the asymptotic option, because by construction the former is better tailored to 
the given sample size n and design matrix X. Nonetheless, the asymptotic option 
is easier to compute. Our theoretical results in section 3 show that the options in 
(14) lead to near-oracle rates of convergence. 

For the asymptotic results, we shall impose the following condition: 

Condition G. We have that log^(p/a) log(l/a) = o{n) and p/a oo as 
n — > oo. 

The following lemma shows that the exact and asymptotic options in (14) im- 
plement the regularization event A ^ cA in the Gaussian case with the exact or 
asymptotic probability 1 ~ a respectively. The lemma also bounds the magnitude 
of the penalty level for the exact option, which will be useful for stating bounds on 
the estimation error. We assume throughout the paper that < a < 1 is bounded 
away from 1, but we allow a to approach as n grows. 

Lemma 1. Suppose that Fq ~ ^. (i) The exact option in (14) implements X ^ cA 
with probability at least 1 — a. (ii) Assume that p/a > 8. For any I < £ < 
{n/ \og{l/a)}^/^ , the asymptotic option in (14) implements A ^ cA with probability 
at least 

_ f 1 ) cxp[21og(2p/a)£{log(l/«)/n}V^] 

1 ^log(p/a)/ l-£{log(l/a)/n}i/2 

where, under Condition G, we have r = 1 + o(l) by setting ^ — > oo such that 
I = o[n"'^/^/{log(p/a) log^^^(l/Q!)}] as n ^ oo. (Hi) Assume that p/a > 8 and 
n > 41og(2/a). Then 

A*(l - a I X) < M(l - a) ^ K2nlog(2p/a)}^/^ . = ^ '''("\^!/^ 

1 — 2{log(2/Q!)/n|^'^ 

where under Condition G, v = 1 + o(l) as n ^ oo. 
In the non-normal case, A can be any of 
A = cAf(1 - a I X), 

A = cmaxi^gjr Ai?(l — a \ X), (15) 
A = cA(l - a) = c7ii/2$-i(l - a/2p), 

which we call the exact, semi-exact, and asymptotic options, respectively. We set 
the confidence level 1 — a and the constant c > 1 as before. The exact option is 
applicable when Fq = i^, as for example in the previous normal case. The semi- 
exact option is applicable when Fq is a member of some family or whenever 
the family gives a more conservative penalty level. We also assume that T in 
(36) is either finite or, more generally, that the maximum in (36) is well defined. 
For example, in applications, where the regression errors e, are thought of having 
a potentially wide range of tail behavior, it is useful to set T = {t(4), t(8), ^(od)} 
where t{k) denotes the Student distribution with k degrees of freedom. As stated 
previously, we can compute the quantiles Ap{l — a \ X) by simulation. Therefore, 
we can implement the exact option easily, and if T is not too large, we can also 
implement the semi-exact option easily. Finally, the asymptotic option is applicable 
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when Fq and design X satisfy Condition M and has the advantage of being trivial 
to compute. 

For the asymptotic resuhs in the non-normal case, we impose the following mo- 
ment conditions. 

Condition M. There exist a finite constant q > 2 such that the law Fq is an 
element of the family J- such that sup„^]^ supj^gjr Ep(\e\'') < oo; the design X obeys 

siip„^i,KjCp-S"(|a^il') < oo- 

We also have to restrict the growth of p relative to and we also assume that a 
is either bounded away from zero or approaches zero not too rapidly. Sec also the 
Supplementary Material for an alternative condition. 

Condition R. As n oo, p ^ an^^''^'^^^'^ /2 for some constant < 77 < 1, and 
= o[n^('?/2-i)A(g/4)}v(g/2-2)/(iQg^)9/2]^ ^f^^^^ q > 2 is defined m Condition M. 

The following lemma shows that the options (36) implement the regularization 
event A cA in the non-Gaussian case with exact or asymptotic probability 1 — a. 
In particular, Conditions R and M, through relations (32) and (34), imply that for 
any fixed w > 0, 

pr{|i?„(e2)_l| >4 = o(q;), n oo. (16) 

The lemma also bounds the magnitude of the penalty level A for the exact and 
semi-exact options, which is useful for stating bounds on the estimation error in 
section 3. 

Lemma 2. (i) The exact option in (36) implements A ^ cA with probability at 
least 1 — a, if Fq ^ F. (ii) The semi-exact option in (36) implements A ^ cA with 
probability at least 1 — a, if either Fq ^ T or Kp{l — a \ X) ^ Ai7'p(l — a \ X) 
for some F Cz T. Suppose further that Conditions M and R hold. Then, (Hi) the 
asymptotic option in (36) implements A 5^ cA with probability at least 1 — a — o{a), 
and (iv) the magnitude of the penalty level of the exact and semi-exact options in 
(36) satisfies the inequality 

maxAF(l-a | X) A(l - a){l + o(l)} {2n log(2p/a)}^/2{l + o(l)}, n ^ oo. 

Thus all of the asymptotic conclusions reached in Lemma 1 about the penalty 
level in the Gaussian case continue to hold in the non-Gaussian case, albeit under 
more restricted conditions on the growth of p relative to n. The growth condition 
depends on the number of bounded moments q of regressors and the error terms: the 
higher q is, the more rapidly p can grow with n. We emphasize that Conditions M 
and R are only one possible set of sufficient conditions that guarantees the Gaussian- 
like conclusions of Lemma 2. We derived them using the moderate deviation theory 
of [22]. For example, in the Supplementary Material, we provide an alternative 
condition, based on the use of the self- normalized moderate deviation theory of [9] , 
which results in much weaker growth condition on p in relation to n, but requires 
much stronger conditions on the moments of regressors. 
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3. Finite-sample and asymptotic bounds on the estimation error 

3.1. Conditions on the Gram matrix. Wc shall state convergence rates for 
6 ~ P — /3o in the Euclidean norm \\5\\2 = {S'Sy^^ and also in the prediction norm 

The latter norm directly depends on the Gram matrix En{xx'). The choice of 
penalty level described in Section 2 ensures the regularization event A ^ cA, with 
probability 1 — a or with probability approaching 1 — a. This event will in turn 
imply another regularization event, namely that S belongs to the restricted set Ag, 
where 

Ag = {5 e MP : \\St4i ^ cII^tIIi,-^ ^ 0}, c = 

c — 1 

Accordingly, we will state the bounds on estimation errors ||5||2,n and \\S\\2 in 
terms of the following restricted eigenvalues of the Gram matrix En{xx'): 

■ ^'^'WShn ~ . \\Shn .... 

Kr = mm — — : — , Kg = mm — — — . 17 

" seA, \\St\\i sea, \\5\\2 ^ ' 

These restricted eigenvalues can depend on n and T, but we suppress the depen- 
dence in our notation. 

In making simplified asymptotic statements, such as those appearing in Section 
1, we invoke the following condition on the restricted eigenvalues: 

Condition RE. There exist finite constants uq > and k > 0, such that the 
restricted eigenvalues obey n and Kg ^ k for all n > Uq. 

The restricted eigenvalues (17) are simply variants of the restricted eigenvalues 
introduced in [4]. Even though the minimal eigenvalue of the Gram matrix En{xx') 
is zero whenever p ^ n, [4] show that its restricted eigenvalues can be bounded away 
from zero, and they and others provide sufficient primitive conditions that cover 
many fixed and random designs of interest, which allow for reasonably general, 
though not arbitrary, forms of correlation between regressors. This makes condi- 
tions on restricted eigenvalues useful for many applications. Consequently, we take 
the restricted eigenvalues as primitive quantities and Condition RE as primitive. 
The restricted eigenvalues are tightly tailored to the £i-penalized estimation prob- 
lem. Indeed, Kg is the modulus of continuity between the estimation norm and the 
penalty-related term computed over the restricted set, containing the deviation of 
the estimator from the true value; and Kg is the modulus of continuity between the 
estimation norm and the Euclidean norm over this set. 

It is useful to recall at least one simple sufficient condition for bounded restricted 
eigenvalues. If for m = slogn, the m-sparse eigenvalues of the Gram matrix 
En{xx') are bounded away from zero and from above for all n > n', i.e.. 



c\ ^ 1 ^ ■ ll'^lli.ri ^ ll'^lli.ri / , / ^ /-,o^ 

[) < K % mm ,, ^,,„ ^ max ,, ^,,„ ^ fc < oo, (18) 

||<5Tc||osSm,5#0 \\5\\^ ||5Tc||os£m.5#0 \\5\\^ 



for some positive finite constants fc, fc', and n', then Condition RE holds once n 
is sufficiently large. In words, (18) only requires the eigenvalues of certain small 
m X m submatrices of the large p x p Gram matrix to be bounded from above 
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and below. The sufHciency of (18) for Condition RE follows from [4], and many 
sufficient conditions for (18) are provided by [4], [28], [13], and [20]. 

3.2. Finite-sample and asymptotic bounds on estimation error. Wc now 

present the main result of this paper. Recall that we do not assume that the noise 
is sub-Gaussian or that a is known. 

Theorem 1. Consider the model described in (1)~(4)- Letc > 1, c = (c+l)/(c— 1), 
and suppose that A obeys the growth restriction Xs'^^'^ ^ UKcP, for some p < 1. If 
A ^ cA, then 

n Ke(l - p^) 

In particular, if X ^ cA with probability at least 1 — a, and En{e^) ^ a;2 with 
probability at least 1 — 7, then with probability at least 1 — a — 7, 

Asi/2 
n 

This result provides a finite-sample bound for 6 that is similar to that for the lasso 
estimator with known cr, and this result leads to the same rates of convergence as 
in the case of lasso. It is important to note some differences. First, for a given value 
of the penalty level A, the statistical structure of square-root lasso is different from 
that of lasso, and so our proof of Theorem 1 is also different. Second, in the proof we 
have to invoke the additional growth restriction, A.s^/^ < nKg, which is not present 
in the lasso analysis that treats a as known. We may think of this restriction as 
the price of not knowing a in our framework. However, this additional condition is 
very mild and holds asymptotically under typical conditions if [s/n) log(p/a) — 0, 
as the corollaries below indicate, and it is independent of a. In comparison, for the 
lasso estimator, if we treat a as unknown and attempt to estimate it consistently 
using a pilot lasso, which uses an upper bound a ^ a instead of cr, a similar growth 
condition {a / a){s /n)\og{p/ a) — > would have to be imposed, but this condition 
depends on a and is more restrictive than our growth condition when a / a is large. 

Theorem 1 implies the following bounds when combined with Lemma 1 , Lemma 
2, and the concentration property (16). 

Corollary 1. Consider the model described in Suppose further that Fq = 

<&, A is chosen according to the exact option in (14), p/ct > 8, and n > 41og(2/a). 
Let c> 1, c^ {c+ l)/(c - 1), J/ = {1 + 2/ log(2p/a)}i/2/[l _ 2{log(2/a)/n}i/2], 
and for any i such that 1< £ < {n/ log(l/a)}i/^ set uj^ = I + £{\og{l/a)/ny/^ + 
£^ log(l/a)/(2n) and ^ = c/ If slogp is relatively small as compared to n, 
namely ci/{2s log(2p/a)}^/^ ^ n^^^KcP for some p < 1, then with probability at 
least 1 — a — 7, 

„^ u f 2slog(2n/Q!)l 2(l + c)i/w 

L n J Ke(l-p^) 

Corollary 2. Consider the model described in (l)-(4) and suppose that Fq = 
Conditions RE and G hold, and {s/n) \og{p/a) — > 0, as n ^ 00. Let X be specified 
according to either the exact or asymptotic option in (14)- There is an o(l) term 
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such that with probability at least 1 — a — o{a), 



^{1-0(1)}- 

Corollary 3. Consider the model described in (l)-(4). Suppose that Conditions 
RE, M, and R hold, and (s/n) log{p/a) — as n —)■ oo. Let X be specified according 
to the asymptotic, exact, or semi-exact option in (36). There is an o(l) term such 
that with probability at least 1 — a — o{a) 

.\\R R\\ <\\R R\\ f 2slog(2p/a) )^/^ 2(l + c) 



«{l-o(l)}- 

As in Lemma 2, in order to achieve Gaussian-like asymptotic conclusions in the 
non-Gaussian case, we impose stronger restrictions on the growth of p relative to 
n. 

4. Computational properties of square-root lasso 

The second main resuh of this paper is to formulate the square-root lasso as a 
conic programming problem, with constraints given by a second-order cone, also 
informally known as the ice-cream cone. This allows us to implement the estimator 
using efficient algorithmic methods, such as interior-point methods, which provide 
polynomial-time bounds on computational time [16, 17], and modern first-order 
methods that have been recently extended to handle very large conic programming 
problems [14, 15, 11, 1]. Before describing the details, it is useful to recall that a 
conic programming problem takes the form min„ c'u subject to An = b and u Cz C, 
where C is a cone. Conic programming has a tractable dual form max^, b'w subject 
to w'A + s = c and s € C* , where C* = {s : s'u > 0, for all u G C} is the dual cone 
of C. A particularly important, highly tractable class of problems arises when C 
is the ice-cream cone, C = Q"^-^ = {{v,t) G R" x M : t ^ IK'Uli which is self-dual, 
C = C*. 

The square-root lasso optimization problem is precisely a conic programming 
problem with second-order conic constraints. Indeed, we can reformulate (8) as 
follows: 



mm 

1+ 



!_ , (R+ j_ R~\ . v^ = yi- x';/3+ +x'^l3 , i = 1, . . . , n. 



(19) 

Furthermore, we can show that this problem admits the following strongly dual 
problem: 



max 



1 " 

-^Viai ■ \J2'i=iXi30.i/n\ ii, X/n, j = I, . . . ,p,\\a\\ iS, n^/'^. (20) 



Recall that strong duality holds between a primal and its dual problem if their 
optimal values are the same, i.e., there is no duality gap. This is typically an 
assumption needed for interior-point methods and first-order methods to work. 
From a statistical perspective, this dual problem maximizes the sample correlation 
of the score variable ai with the outcome variables y^ subject to the constraint that 
the score is approximately uncorrclated with the covariates Xij. The optimal 
scores ai equal the residuals yi — x'^P, for alH = 1, . . . , n, up to a renormalization 
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factor; they play a key role in deriving sparsity bounds on /3. We formalize the 
preceding discussion in the following theorem. 

Theorem 2. The square-root lasso problem (8) is equivalent to the conic pro- 
gramming problem (19), which admits the strongly dual problem (20). Moreover, 
if the solution (3 to the problem (8) satisfies Y ^ Xj3, the solution j3~^ , /3~ , 
V = (vi, . . . ,Vn) to (19), and the solution a to (20) are related via /3 = (3^ — j3~ , 
Vi = Ui — x[l3 {i — 1, . . . ,n), and a = . 

The conic formulation and the strong duality demonstrated in Theorem 2 allow 
us to employ both the interior-point and first-order methods for conic programs to 
compute the square-root lasso. We have implemented both of these methods, as well 
as a coordinatcwise method, for the square-root lasso and made the code available 
through the authors' webpages. The square-root lasso runs at least as fast as the 
corresponding implementations of these methods for the lasso, for instance, the 
Sdpt3 implementation of interior-point method [24] , and the Tfocs implementation 
of first-order methods by Becker, Candes and Grant described in [2]. We report 
the exact running times in the Supplementary Material. 



5. Empirical performance of square-root lasso relative to lasso 

In this section we use Monte Carlo experiments to assess the finite sample per- 
formance of (i) the infcasiblc lasso with known a which is unknown outside the 
experiments, (ii) the post infcasiblc lasso, which applies ordinary least squares to 
the model selected by infcasiblc lasso, (iii) the square-root lasso with unknown cr, 
and (iv) the post square-root lasso, which applies ordinary least squares to the 
model selected by square-root lasso. 

We set the penalty level for the infeasible lasso and the square-root lasso accord- 
ing to the asymptotic options (6) and (9) respectively, with 1 — a = 0.95 and c = 1.1. 
We have also performed experiments where we set the penalty levels according to 
the exact option. The results are similar, so we do not report them separately. 

We use the linear regression model stated in the introduction as a data-generating 
process, with cither standard normal or t(4) errors: (a) ^ A^(0,1), (&) ~ 
t(4)/2^/^, so that E{ef) ~ 1 in either case. We set the true parameter value 
as Po = (1) Ij Ij I7 1, 0, . . . , 0)', and vary a between 0.25 and 3. The number of 
regressors is p = 500, the sample size is n = 100, and we used 1000 simulations for 
each design. We generate regressors as Xi ^ N{0, S) with the Toeplitz correlation 
matrix E^-fc = (l/2)l-'-'=l. We use as benchmark the performance of the oracle 
estimator with known true support of /3o which is unknown outside the experiment. 

We present the results of computational experiments for designs (a) and (b) 
in Figs. 5 and 2. For each model. Figure 5 shows the relative average empirical 
risk with respect to the oracle estimator /3* , E{\\(3 — /?o||2,ri)/£'(||/3* — /3oli2,n), and 
Figure 2 shows the average number of regressors missed from the true model and 
the average number of regressors selected outside the true model, i5(|supp(/3o) \ 
supp(/3)|) and £'(|supp(/3) \ supp(/3o)|), respectively. 

Figure 5 shows the empirical risk of the estimators. We see that, for a wide range 
of the noise level u, the square-root lasso with unknown a performs comparably to 
the infeasible lasso with known a. These results agree with our theoretical results. 
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Figure 1. Average relative empirical risk of infeasible lasso 
(solid), square- root lasso (dashes), post infeasible lasso (dots), and 
post square- root lasso (dot-dash), with respect to the oracle esti- 
mator, that knows the true support, as a function of the standard 
deviation of the noise a. 
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Figure 2. Average number of regressors missed from the true 
support for infeasible lasso (solid) and square-root lasso (dashes), 
and the average number of regressors selected outside the true sup- 
port for infeasible lasso (dots) and square- root lasso (dot-dash), as 
a function of the noise level a. 

which state that the upper bounds on empirical risk for the square-root lasso asymp- 
totically approach the analogous bounds for infeasible lasso. The finite-sample dif- 
ferences in empirical risk for the infeasible lasso and the square-root lasso arise 
primarily due to the square-root lasso having a larger bias than the infeasible lasso. 
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This bias arises because the square-root lasso uses an effectively heavier penalty 
induced by Q(/3) in place of cr^; indeed, in these experiments, the average values of 
Qififl'^ja varied between 1.18 and 1.22. 

Figure 5 shows that the post square-root lasso substantially outperforms both 
the infeasible lasso and the square-root lasso. Moreover, for a wide range of cr, the 
post square-root lasso outperforms the post infeasible lasso. The post square-root 
lasso is able to improve over the square-root lasso due to removal of the relatively 
large shrinkage bias of the square-root lasso. Moreover, the post square-root lasso 
is able to outperform the post infeasible lasso primarily due to its better spar- 
sity properties, which can be observed from Figure 2. These results on the post 
square-root lasso agree closely with our theoretical results reported in the arXiv 
working paper "Pivotal Estimation of Nonparametric Functions via Square-root 
Lasso" by the authors, which state that the upper bounds on empirical risk for the 
post square-root lasso asymptotically are no larger than the analogous bounds for 
the square-root lasso or the infeasible lasso, and can be strictly better when the 
square- root lasso acts as a near-perfect model selection device. We see this hap- 
pening in Figure 5, where as the noise level a decreases, the post square-root lasso 
starts to perform as well as the oracle estimator. As we see from Figure 2, this hap- 
pens because as cr decreases, the square-root lasso starts to select the true model 
nearly perfectly, and hence the post square-root lasso starts to become the oracle 
estimator with a high probability. 

Next let us now comment on the difference between the normal and t(4) noise 
cases, i.e., between the right and left panels in Figure 5 and 2. We see that the re- 
sults for the Gaussian case carry over to i(4) case with nearly undetectable changes. 
In fact, the performance of the infeasible lasso and the square-root lasso under t(A) 
errors nearly coincides with their performance under Gaussian errors, as predicted 
by our theoretical results in the main text, using moderate deviation theory, and 
in the Supplementary Material, using self-normalized moderate deviation theory. 

In the Supplementary Material, we provide further Monte Carlo comparisons 
that include asymmetric error distributions, highly correlated designs, and feasible 
lasso estimators based on the use of conservative bounds on a and cross valida- 
tion. Let us briefly summarize the key conclusions from these experiments. First, 
presence of asymmetry in the noise distribution and of a high correlation in the 
design does not change the results qualitatively. Second, naive use of conserva- 
tive bounds on a does not result in good feasible lasso estimators. Third, the use 
of cross validation for choosing the penalty level does produce a feasible lasso es- 
timator performing well in terms of empirical risk but poorly in terms of model 
selection. Nevertheless, even in terms of empirical risk, the cross-validated lasso 
is outperformed by the post square-root lasso. In addition, cross-validated lasso is 
outperformed by the square-root-lasso with the penalty level scaled by 1/2. This is 
noteworthy, since the estimators based on the square-root lasso are much cheaper 
computationally. Lastly, in the 2011 arXiv working paper "Pivotal Estimation of 
Nonparametric Functions via Square-Root Lasso" we provide a further analysis of 
the post square-root lasso estimator and generalize the setting of the present paper 
to the fully nonparametric regression model. 
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Supplementary Material 

The online Supplementary Material contains a complementary analysis of the 
penalty choice based on moderate deviation theory for self-normalizing sums, dis- 
cussion on computational aspects of square-root lasso as compared to lasso, and 
additional Monte Carlo experiments. We also provide the omitted part of proof of 
Lemma 2, and list the inequalities used in the proofs. 

Appendix 1 

Proofs of Theorems 1 and 2. 

Proof of Theorem 1. Step 1. We show that (5 = /3 — /3o <= Ag under the prescribed 
penalty level. By definition of /3 

{Q0)Y'^ - {QiMY'^ < -m\i - -Ml ^ -(il^rlli - Pt^IIi), (21) 

n n n 

where the last inequality holds because 

ii/3oiii- = ii/3ot||i-ii3^t||i-ii3^t<=iIi«;||?t||i-ii?t=iIi. (22) 

Also, if A cn||S'||oo then 

{Q{hY'^ - {QiMY'^ ^ -!|5||oo||?||i ^ --(||?t||i + ll?T=l|i),(23) 

cn 

where the first inequality hold by convexity of Q^^^. Combining (21) with (23) we 
obtain 

- -{¥t\\i + il^Tclli) ^ -(i|?T!|i - ¥tA\i). (24) 
cn n 

that is 

||?tHIi < ^II?t||i =c||?t||i. (25) 
c — 1 

Step 2. We derive bounds on the estimation error. We shall use the following 
relations: 

Qm-Q{M = ||?||L-2£;„(aea^'?), (26) 

Q0) - Q(/3o) - [{QihY'^ + {QiMY"] [{Q0)Y'^ - {QiMY'^] m 

2\E^{nex% s=: 2{Q(/3o)}'/'||5i|oo!|5||i, (28) 
||?t||i ^C^^^^&^,?eA,-, (29) 

where (28) holds by Holder inequality and (29) holds by the definition of Kg. 
Using (21) and (26)-(29) we obtain 

mlr. < 2{o(/3o)}'/'||s||^|H|i + [{Q{M'^ + {QiMY"] ^ C'^'j!^"'-" - ||5t=||i). 

(30) 
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Also using (21) and (29) wc obtain 

imv" ^ {QmV' + - { ''^'"^"^'" 1 . (31) 



Combining inequalities (31) and (30), wc obtain ||(5|||„ s$ 2{Q{Po)Y/^\\S\\oo\\S\\i 
we obtain 

II^IIL. 2{Q(/3o)}^/'||^||oo||?t||i+2{Q(/3o)}^/'^||?||2,„+ f^||?||2,„ 
and then using (29) we obtain 

'Asl/2\2l „-^„, /I \ ,.A.sl/2, 



< 2fi + l){g(/3„)}V2^||?|l 



2,n- 



Provided that (nKg) ^As^/^ ^ p < 1 and solving the inequality above we obtain 
the bound stated in the theorem. □ 

Proof of Theorem 2. The equivalence of square-root lasso problem (8) and the conic 
programming problem (19) follows immediately from the definitions. To establish 
the duality, for e = (l,...,l)', we can write (19) in matrix form as 

t A,+ A, V + XI3+ -Xp- =Y 

uJT'V n'^ n""^ ' {v,t) ^ Q^+\ /3+ G M^, /?" € M^. 

By the conic duality theorem, this has dual 

, s* = l/ni/2, a + = 0, X'a + s+ = Ae/n, -X'a + = Xe/n 

a,s*.?^f+,.- " ■ (s^ s*) e Q«+i, ,s+ € Rl, s- e K^. 

The constraints X'a + s+ = A/n and —X'a + = X/n leads to ||X'a||oo ^ X/n. 
The conic constraint (s",s*) £ Q"'^^ leads to l/rt^/^ = s* ^ ||s"j| = ||a||. By scaling 
the variable a by n we obtain the stated dual problem. 

Since the primal problem is strongly feasible, strong duality holds by Theorem 
3.2.6 of [17]. Thus, by strong duality, we have X^ILi V^^i ~ n~^^^\\Y — + 
n~'^XJ2^^i Since ^"=1 XijaiPj = X\l3j\/n for every j = 1, . . . ,p, we have 

Iir-^PII . i iiJ-^pii , i ■^^ 



i— 1 J— 1 ^—1 ^—1 1 

Rearranging the terms we have n~'^Yl'i=i{{yi ~ = 11^ ^ XPW/n^/'^. If 

||y — > 0, since ||a|| ^ n^^^, the equality can only hold for a — n}/'^{Y — 

xp)/\\Y - xpw = (r - xd)i{Q0)Y/\ □ 

Appendix 2 

Proofs of Lemmas 1 and 2. 

Proof of Lemma 1. Statement (i) holds by definition. To show statement (ii), we 
define t„ = $^^(1 - a/2p) and < r,i = ^{log(l/a)/n}^/^ < 1. It is known that 
log(p/a) < < 21og(2p/a) when p/a > 8. Then since Zj = v}/'^En{xje) ^ 
N{0, 1) for each j, conditional on X, we have by the union bound and Fq = $, 
pr(cA > cn^/^tn \ X) p pr{\Z,\ > t„(l - r„) | X} + pT{E^{e^) < (1 - r^f} < 
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2p ^{tn{l — r„)} + pr{En{e^) < (1 — rn)^}- Statement (ii) follows by observing 
that by Chernoff tail bound for x^l*^): Lemma 1 in [12], pr{i?„(e^) < (1 — rn)^} ^ 
exp(— and 



2p #{t„(l-r„)} ^2p 



2p 



(/>(*«) exp(t2r„ - ^tlrl) 



1 



log(p/a) 



1 



exp{21og(2p/a)r„} 



n 



I n 



where we have used the inequality + 1^) ^ < (j){t)/t for t > 0. 

For statement (iii), it is sufficient to show that pr(A<i, > vn^/'^tn \ X) ^ a. 
It can be seen that there exists a v' such that w' > {1 + 2/ log(2p/a)}"'^/2 and 
1 - v' /v > 2{log(2/a)/n}i/2 go ^^jj^t pr(A$ > vn^/^t„ \ X) < pmaxi^^j^p pr(|Zj| > 
v'tn I X)+w{Enie^) < {v'/vy} = 2p$(w't„) + pr[{^^n(e^)}^/^ < v'/v]. Proceeding 
as before, by Chernoff tail bound for x^(n), pr[{i5„(e2)}i/2 < ti'/ti] ^ exp{— — 
v'/v)'^/A} ^ a/2, and 



2p^v't,,) ^ 2p^^:i-^ = 2p 



^ 2p$(t„) ( 1 + ^ 



1 ^ CXp{-i4(„'2 _ 



1 ^ exp{~ii2(i;'2 - 1)} 



cxp{-log(2p/Q:)(u'2 - 1)} 
log(p/a) / v' 
s$ 2acxp{-log(2p/Q!)(u'2 - 1)} < a/2. 



Putting the inequalities together, we conclude that pr(A$ > vv}/'^tn \ X) ^ a. 

Finally, the asymptotic result follows directly from the finite sample bounds and 
noting that p/a ^ oo and that under the growth condition we can choose £ oo 
so that £log(p/a)log^/^(l/a) = o{n^^^). □ 



Proof of Lemma 2. Statements (i) and (ii) hold by definition. To show (iii), con- 
sider first the case of 2 < g ^ 8, and define tn = $~^(1 — oi/2p) and r„ = 
a~'^n~^^^~^^'''^^^^^^£n, for some in which grows to infinity but so slowly that the 
condition stated below is satisfied. Then for any Fq = Fon and X = Xn that obey 
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Condition M: 

pr(cA > cn^/^tn \ X) 

p max pr{\n^^^En{xje)\ > t„(l - r„) | X} + pr[{£;„(e2)}i/2 < 1 

<(2) P max pr{\n^/^Enixje)\ > t„(l - r„) | X} + o{a) 
i^jsSp 

=(3) 2p ${i„(l - r„)}{l + o(l)} + o(a) 
{tn(l-r„) } 

2p — — — {1 + o(l)} + o{a) 



=(4) 2p "^/;; „ 7^ 1 + o(l)} + o(a 



=(5) 2p + o(l)} + o(a) ==(6) 2p $(t„){l + o(l)} + o(a) = a{l + o(l)}, 



where (1) holds by the union bound; (2) holds by the application of either Rosen- 
thal's inequality (Rosenthal 1970) for the case oi q > 4 and Vonbahr-Esseen's 
inequalities (von Bahr & Esseen 1965) for the case of 2 < q ^ 4, 

pr[{£;„(e2)}i/2 < 1 - r„] pT{\E^{e^) - 1\ > r„} < aC^^' = o(a), (32) 

(4) and (6) by (?!)(t)/t ^ $(<) as t ^ oo; (5) by t,V„ = o(l), which holds if 
log(p/a)a~ 9 7i~t(i~2/g)Ai/2}£^^ _ ^^2^^^ Under our condition log(p/a) = O(logn), 
this condition is satisfied for some slowly growing if 

= o{n(«/2-i)^?/V log9/2 n}. (33) 

To verify relation (3), by Condition M and Slastnikov's theorem on moderate de- 
viations, see [22] and [19], we have that uniformly in ^ |t| ^ fclog^^^n for some 
/c^ < q — 2, uniformly in 1 < j < p and for any Fq ~ Fon S pr{n^/^|i?„(xje)| > t \ 
X}/{2^{t)} 1, so the relation (3) holds for t = t„(l - r„) {2 log(2p/a)}i/2 ^ 
{ri{q — 2)logn}^/^ for 77 < 1 by Condition R. We apply Slastnikov's theorem 

— Xijti^ where we allow the design X, the law Fq^ 
and index j to be indexed by n. Slastnikov's theorem then applies provided 
sup„j^p-E„{£'i^„(|z„|'?)} = sup„ ,,^p£'„(|a;j|'')£'j^o(|e|«) < cxj, which is implied by 
our Condition M, and where we used the condition that the design is fixed, so 
that e-i are independent of Xij. Thus, we obtained the moderate deviation result 
uniformly in 1 ^ j ^ p and for any sequence of distributions Fq = _Fon and designs 
X = Xn that obey our Condition M. 

Next suppose that (7^8. Then the same argument applies, except that now 
relation (2) could also be established by using Slastnikov's theorem on moderate 
deviations. In this case redefine r„ = k{\ogn/n\^/'^] then, for some constant < 
{{q/2) - 2}i/2 we have 

pr{£;„(e2) < (1 - r„)'} ^ pr{|S„(e2) - 1| > r„} < (34) 
so the relation (2) holds if 

l/a = o(n'='). (35) 

This applies whenever g ^ 4, and this results in weaker requirements on a if g ^ 8. 
The relation (5) then follows if tf^Vn = o(l), which is easily satisfied for the new r„, 
and the result follows. 



SQUARE-ROOT LASSO 



17 



Combining conditions in (33) and (35) to give the weakest restrictions on the 
growth of a~^, we obtain the growth conditions stated in the lemma. 

To show statement (iv) of the lemma, it suffices to show that for any v' > 1, and 
F £ J^, pr(Aj? > v'n^^^tn \ X) = o{a), which follows analogously to the proof of 
statement (iii); we relegate the details to the Supplementary Material. □ 
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Supplementary Appendix for "Square-root lasso: pivotal 
recovery of sparse signals via conic programming" 



Abstract. In this appendix wc gather additional theoretical and 
computational results for "Square-root lasso: pivotal recovery of 
sparse signals via conic programming." We include a complemen- 
tary analysis of the penalty choice based on moderate deviation 
theory for self-normalizing sums. We provide a discussion on com- 
putational aspects of square-root lasso as compared to lasso. We 
carry out additional Monte Carlo experiments. We also provide the 
omitted part of proof of Lemma 2, and list the inequalities used in 
the proofs. 



Appendix A. Additional Theoretical Results 

In this section we derive additional results using moderate deviation theory for 
self-normalizing sums to bound the penalty level. These results are complementary 
to the results given in the main text since conditions required here arc not implied 
nor imply the conditions there. These conditions require stronger moment assump- 
tions but in exchange they result in weaker growth requirements on p in relation to 
n. 

Recall the definition of the choices of penalty levels 

exact: A = cA^o (1 - a | A), , , 

asymptotic: A = cri^/^<I>^^ (1 — Q;/2p), 

where Afo(1 - a | A) = (1 - a)-quantile of 7i\\En{xe)\\^/{En{e^)Y/'^ . We will 
make use of the following condition. 

Condition SN. There is a q > A such that the noise obeys sup„^2 ^J'o (kl'') < 
oo, and the design X obeys sup„^j maxi^i^„ |ja;i|joo < oo. Moreover, we also as- 
sume \og{p/a)a~'^/'in~^/'^ log^^^(n V p/a) ~ o(l) and En{x^) = 1 (j ~ \, . . . ,p). 

Lemma 3. Suppose that condition SN holds andn — >■ oo. Then, (1) the asymptotic 
option in (36) implements A > cA with probability at least 1 — a{l + o(l)}, and (2) 

KfA^-c, \X) < {l-Fo(l)}ni/2$-i(i_a/2p). 

This lemma in combination with Theorem 1 of the main text imply the following 
result: 

Corollary 4. Consider the model described in the main text. Suppose that Con- 
ditions RE and SN hold, and {s / n) log{p / a) — > as n — >■ oo. Let A be specified 
according to the asymptotic or exact option in (36). There is an o(l) term such 
that with probability at least 1 — a{\ + o(l)} 

.\\R R\\ <\\R R\\ f 2slog(2p/a) )^/^ 2(1 + c) 

I n J k{1-o(1)} 
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Appendix B. Additional Computational Results 

B.l. Overview of Additional Computational Results. In the main text we 
formulated the square-root lasso as a convex conic programming problem. This fact 
allows us to use conic programming methods to compute the square-root lasso esti- 
mator. In this section we provide further details on these methods, specifically on (i) 
the first-order methods, (ii) the interior-point methods, and (iii) the componentwise 
search methods, as specifically adapted to solving conic programming problems. We 
shall also compare the adaptation of these methods to square-root lasso with the 
respective adaptation of these methods to lasso. 

B.2. Computational Times. Our main message here is that the average running 
times for solving lasso and the square-root lasso are comparable in practical prob- 
lems. We document this in Table 1, where we record the average computational 
times, in seconds, of the three computation methods mentioned above. The de- 
sign for computational experiments is the same as in the main text. In fact, we 
see that the square-root lasso is often slightly easier to compute than the lasso. 
The table also reinforces the typical behavior of the three principal computational 
methods. As the size of the optimization problem increases, the running time for 
an interior-point method grows faster than that for a first-order method. We also 
see, perhaps more surprisingly, that a simple componentwise method is particularly 
effective, and this might be due to a high sparsity of the solutions in our examples. 
An important remark here is that we did not attempt to compare rigorously across 
different computational methods to isolate the best ones, since these methods have 
different initialization and stopping criteria and the results could be affected by 
that. Rather our focus here is comparing the performance of each computational 
method as applied to lasso and the square-root lasso. This is an easier compari- 
son problem, since given a computational method, the initialization and stopping 
criteria are similar for two problems. 



71 = 100, p = 500 


Componentwise 


First Order 


Interior Point 


lasso 


0-2173 


10-99 


2-545 


square-root lasso 


0-3268 


7-345 


1-645 


71 = 200, p = 1000 


Componentwise 


First Order 


Interior Point 


lasso 


0-6115 


19-84 


14-20 


square-root lasso 


0-6448 


19-96 


8-291 


71 = 400, p = 2000 


Componentwise 


First Order 


Interior Point 


lasso 


2-625 


84-12 


108-9 


square-root lasso 


2-687 


77-65 


62-86 



Table 1. We use the same design as in the main text, with s = 5 
and cr = 1, we averaged the computational times over 100 simula- 
tions. 
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B.3. Details on Computational Methods. Below we discuss in more detail the 
applications of these methods for the lasso and the square-root lasso. The similari- 
ties between the lasso and the square-root lasso formulations derived below provide 
a theoretical justification for the similar computational performance. 

Interior-point methods. Interior-point method (ipm) solvers typically focus 
on solving conic programming problems in standard form: 

minc'tw : Aw — b^wCzK, (37) 

w 

where is a cone. The main difficulty of the problem arises because the conic 
constraint is binding at the optimal solution. To overcome the difficulty, ipms 
regularize the objective function with a barrier function so that the optimal solution 
of the regularized problem naturally lies in the interior of the cone. By steadily 
scaling down the barrier function, an ipm creates a sequence of solutions that 
converges to the solution of the original problem (37). 

In order to formulate the optimization problem associated with the lasso es- 
timator as a conic programming problem (37), specifically, associated with the 
second-order cone Q'^+i = {{v,t) e M''"'"^ : t ^ ||t;||}, we let (3 ^ (3+ - p- for 
/3"'" ^ and f3~ ^ 0. For any vector v G IR" and scalar i ^ 0, we have that v'v ^ t 
is equivalent to |j(w, [t — 1)/2)||2 ^ {t + l)/2. The latter can be formulated as a 
second-order cone constraint. Thus, the lasso problem can be cast as 

t,A^ + v = Y - Xp+ + Xp-,t^-l + 2ai,t = l + 2a2 

t,p+n.a,,v nn j^^^^ ' ■ («,a2,ai) e Q"+^ O 0,/3+ G H^, G IR^. 

Recall from the main text that the square-root lasso optimization problem can be 
cast similarly, but without auxiliary variables ai,a2' 

t ^^V-.o+^o-N v = Y - XI3+ +XI3- 
t./s™""-,. ^^^^ +'^^>'- (v, t) G g"+i, /3+ G IR^, /3- G IR^. 

These conic formulations allow us to make several different computational methods 
directly applicable to compute these estimators. 

First-order methods. Modern first-order methods focus on structured convex 
problems of the form: 

vain f {A{'w) + h] + h{w) or m\T\h{w) : A{w)'\-b^K, 

w w 

where / is a smooth function and /i is a structured function that is possibly non- 
differentiablc or having extended values. However it allows for an efficient proximal 
function to be solved, see 'Templates for Convex Cone Problems with Applications 
to Sparse Signal Recovery' arXiv working paper 1009.2065 by Becker, Candcs and 
Grant. By combining projections and subgradient information, these methods con- 
struct a sequence of iterates with strong theoretical guarantees. Recently these 
methods have been specialized for conic problems, which includes the lasso and the 
square-root lasso problems. 
Lasso is cast as 

m.\nf{A{w) + b] + h{w) 
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where /(•) = || • ||V?i, h{-) = (A/?i)|| ■ A ^ X, and b = -Y. The projection 
required to be solved on every iteration for a given current point /3'' is 



/3(/3'=) - argmin2i?„{x(y - + -^11/3 - /3"|P + 



1 A,„„ 



where /i is a smoothing parameter. It follows that the minimization in (3 above is 
separable and can be solved by soft-thresholding as 



= sign 



, 2E„{x,{y~x'f3'')} 



For the square-root lasso the "conic form" is 

mmh{w) : A{w)+bEK. 

w 

Letting Q"+i = {{z,t) e IR" x R : t ^ \\z\\} and h{w) = = t/n^/^ 

(A/7i)||/3|| 1 we have that 

. t X. 



I3,t n 



+ -m\i- Ail3,t) + be 



-^n+l 



where 6= (-F',0)' and A(/3,i) {P'X',ty. 

In the associated dual problem, the dual variable z € M" is constrained to be 
llzjl ^ (the corresponding dual variable associated with t is set to to 

obtain a finite dual value). Thus wc obtain 



„„ - i + - 

||z|Kl/ni/2 ;9 71 2 



max inf + - (3''f - z'(F - X/3). 



Given iterates (3'', z'^, as in the case of lasso, the minimization in /3 is separable and 
can be solved by soft-thresholding as 

= sign{/3}- + (X'zVA*),}max{|/3^^ -I- (X'zVm).I - 0} . 

The dual projection accounts for the constraint ||z|| ^ l/n^^^ and solves 

z(/3^z'^) -arg min ^\\z - Zk\\^ + (Y - X fi'')' z 

||2|Kl/ni/2 Ztk 

which yields 

It is useful to note that, in the Tfocs package, the following command line com- 
putes the square-root lasso estimator: 
opts = tfocs-SCD; 

[ beta, out ] = tfocs_SCD( proxJl(lambda/n), { X, -Y}, projJ2(l/sqrt(n)), le-6 ); 
where n denotes the sample size, lambda the penalty choice, X denote the n by p 
design matrix, and Y a vector with n observations of the response variable. The 
square-root lasso estimator is stored in the vector beta. 

Componentwise Search. A common approach to solve unconstrained multi- 
variate optimization problems is to do componentwise minimization, looping over 
components until convergence is achieved. This is particulary attractive in cases 
where the minimization over a single component can be done very efficiently. 
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Consider the following lasso optimization problem: 

mini?„{(y-x'/3f} + ^f] 7,1/^,1. 

Under standard normalization assumptions we would have 7^ = 1 and £'„(a;|) = 1 
( j = I, . . . ,p). The main ingredient of the componentwise search for lasso is the 
rule that sets optimally the value of f3j given fixed the values of the remaining 
variables: 

For a current point P, let /3_j = (/3i, /32, • ■ ■ , 0, Pj+i, . . . , /3p)': 

If 2En{xj{y — x'/3-j)} > Xjj/n, the optimal choice for f3j is 

/?, = [-2E„{xjiy - x'p^j)} + Xjj/n] lEn{x]). 
If 2En{xj{y — x'f3-j)} < —X'jj/n, the optimal choice for /3j is 

/?, = [2i?„{x,(y - - Xj,/n] /E^{x^). 

If 2|£;„{xj(y - x'l3-j)}\ ^ Xj,/n, then /?, = 0. 

This simple method is particularly attractive when the optimal solution is sparse 
which is typically the case of interest under choices of penalty levels that dominate 
the noise like A > cn||S'||oo- 

Despite the additional square-root, which creates a non-separable criterion func- 
tion, it turns out that the componentwise minimization for the square-root lasso 
also has a closed form solution. Consider the following optimization problem: 

min [i?„{(y-x'/?f}]'^%^E7.|/3.|. 

i=i 

As before, under standard normalization assumptions we would have jj = 1 and 
En{x'j) = 1 for j = l,...,p. 

The main ingredient of the componentwise search for square- root lasso is the rule 
that sets optimally the value of /3j given fixed the values of the remaining variables: 

If Er,{xj{y - x'(3.,)} > (A/n)7,{Q(/3_,)}V2, set 

E„{x,{y-x'p,,)] ^ A7, [W~,)-{Enix,y-x,x'^.,)nE„{x^^)}-']"^' 

If Er,{xj[y - x'fi^j)} < -(A/n)7,{Q(/3_,)}i/2, set 

E^ix.jy - x'p-,)} A7, - ^^"fe^/^ - x,x'l3-,)r{E4x^,)}-'] 

E^{x]) E„{x]) K -{A272/S„(a-2)}]i/2 

If |£;„{x,(y- (A/n)7,{Q(/3_,)}i/2, set /3, = 0. 

Appendix C. Additional Monte Carlo Results 

C.l. Overview of Additional Monte Carlo Results. In this section we provide 
more extensive Monte Carlo experiments to assess the finite sample performance 
of the proposed square-root lasso estimator. First we compare the performances 
of lasso and square-root lasso for different distributions of the noise and different 
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designs. Second we compare square-root lasso with several feasible versions of lasso 
that estimate the unknown parameter a. 

C.2. Detailed performance comparison of lasso and square-root lasso. Re- 
garding the parameters for lasso and square-root lasso, we set the penalty level 
according to the asymptotic options defined in the main text: 

lasso penalty: ctc 2n^/^$^^(l— a/2p) square-root lasso penalty: c a/2p) 

respectively, with 1 — a = 0.95 and c = 1.1. As noted in the main text, experiments 
with the penalty levels according to the exact option led to similar behavior. 

We use the linear regression model stated in the introduction of the main text 
as a data-generating process, with either standard normal, t(4), or asymmetric 
exponential errors: (a) ei^Ar(0, 1), (b) ~ t(4)/2^/^, or (c) ei ~ exp(l) — 
1 so that E{ef) = 1 in either case. We set the true parameter value as /3q = 
(1, 1, 1,1,1,0,..., 0)', and we vary the parameter a between 0.25 and 3. The number 
of regressors is p = 500, the sample size is n = 100, and we used 100 simulations 
for each design. We generate regressors as Xi ^ N{0,Y,). We consider two design 
options for S: Tocplitz correlation matrix Sj^ — (l/2)l-'~''"l and equicorrelated 
correlation matrix = (1/2). 

The results of computational experiments for designs a), b) and c) in Figures 3 
and 3 illustrates the theoretical results indicated obtained in the paepr. That is, 
the performance of the non-Gaussian cases b) and c) is very similar to the Gaussian 
case. Moreover, as expected, higher correlation between covariates translates into 
larger empirical risk. 

The performance of square-root lasso and post square-root lasso are relatively 
close to the performance of lasso and post lasso that knows a. These results are in 
close agreement with our theoretical results, which state that the upper bounds on 
empirical risk for square-root lasso asymptotically approach the analogous bounds 
for infeasible lasso. 

C.3. Comparison with feasible versions of lasso. Next we focus on the Toeplitz 
design above to compare many traditional estimators related to lasso. More specifi- 
cally we consider the following estimators: (1) oracle estimator, which is ols applied 
to the true minimal model (which is unknown outside the experiment), (2) infeasi- 
ble lasso with known a (which is unknown outside the experiment), (3) post lasso, 
which applies ols to the model selected by infeasible lasso, (4) square-root lasso, 
(5) post square-root lasso, which applies least squares to the model selected by 
square-root lasso, (6) 1-step feasible lasso, which is lasso with an estimate of a 
given by the conservative upper bound a = [En{{y— y)^}]^^^ where y = En{y), (7) 
post 1-step lasso, which applies least squares to the model selected by 1-step lasso, 
(8) 2-stcp lasso, which is lasso with an estimates of cr given by the 1-step lasso esti- 
mator /3, namely a = {Q(/3)}^/^, (9) post 2-step lasso, which applies least squares 
to the model selected by 2-step lasso, (10) cv-lasso, which is lasso with an estimate 
of A given by K-fold cross validation, (10) post cv-lasso, which applies OLS to the 
model selected by K-fold lasso, (11) square-root lasso (1/2), which uses the penalty 
of square-root lasso multiplied by 1/2, (12) post square-root lasso (1/2), which ap- 
plies least squares to the model selected by square-root lasso (1/2). We generate 
regressors as Xi ^ iV(0, E) with the Toeplitz correlation matrix Ejfc — (1/2)1^^'^'. 
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We focus our evaluation of the performance of an estimator f3 on the relative av- 
erage empirical risk with respect to the oracle estimator f3* , /3o||2.n) /E(\\/3* — 

We present the results comparing square-root lasso to lasso where the penalty 
parameter A is chosen based on K-iold cross-validation procedure. We report the 
experiments for designs a), b) and c) in Figure 5. The first observation is that as 
indicated by theoretical results of the paper, the performance of the non-Gaussian 
cases b) and c) is very similar to the Gaussian case so we focus on the later. 

We observe that ev-lasso does improve upon square-root lasso (and infeasible 
lasso as well) with respect to empirical risk. The cross-validation procedure se- 
lects a smaller penalty level, which reduces the bias. However, ev-lasso is uniformly 
dominated by a square- root lasso method with penalty scaled by 1/2. Note the com- 
putational burden of cross-validation is substantial since one needs to solve several 
different lasso instances. Importantly, cv-lasso does not perform well for purposes 
of model selection. This can be seen from the fact that post cv-lasso performs 
substantially worse than cv-lasso. Figure 5 also illustrates that square-root lasso 
performs substantially better than ev-lasso for purposes of models selection since 
post square-root lasso thoroughly dominates all other feasible methods considered. 

Figure 6 compares other feasible lasso methods that are not as computational 
intense as cross-validation. The estimator with the best performance for all noise 
levels considered was the post square-root lasso reflecting the good model selec- 
tion properties of the square-root lasso. The simple 1-step lasso with conservative 
estimate of a does very poorly. The 2-step lasso does better, but it is still domi- 
nated by square-root lasso. The post 1-step lasso and the post 2-step lasso are also 
dominated by the post square-root lasso on all noise levels tested. 



Appendix D. Proofs of Additional Theoretical Results 

Proof of Lemma 3. Part 1. Let t„ = $^^(1 — a/2p) and for some Wn — ^ oo slowly 
enough let u„ ~ Wna^'^^'^n^^^^ iog^^^{n'V p) < 1/2 for n large enough. Thus, 

pr (A > n^/'^tn I X) pr maxi-j^^p {e„{^C^')02 > (1 - "«)^n I ^ 

r ( E (x^e^) } ^^'^ 

-l-pr maxisjjscp | E^{e^) / >^ + Un\X 

since (1 4- w„)(l — Un) < 1. To bound the first term above, by Condition SN, we 
have that for n large enough i„ + 1 ^ 7i^/^/[£„ maxisjj^p{i?„(|a;j p)i?(|eip)}^/^] 
where £n ^ oo slowly enough. Thus, by the union bound and Lemma 7 



pr 



^ p max pr omi/-) > (1 " Un)t„ X 

<2p${(l-u„)t4(l + jl-) 

where i^M„ = o(l) under condition SN, and the last inequality follows from stan- 
dard bounds on $ = 1 — $, and calculations similar to those in the proof of Lemma 
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1 of the main text. Moreover, 



pr 



max < ^;(f ^')l V^'>l + ^„|X 



s; pr <^ max lE^ixje^)] > 1 + u„ \ X \ + 



2 



+pT{En{e^) < 1 - K/2)} 



since 1/(1 + it„) ^ 1 — u„ + ^ 1 — (u„/2) since it„ ^ 1/2. It follows that 

w{E„{e^) < 1 - iun/2)} s; pr{|£;„(e2) - 1| > uj2} 

< awn"^^ log-^/^in \Jp)= o{a) 

by the choice of Un and the application of Rosenthal's inequality. 

1/2 

Moreover, for n sufficiently large, letting ti = T2 = ajwn , we have 



^ 4 



1/2 /s^,i..l<J.\2/9 



by condition SN since we have g > 4, maxi^i^„ ll^^'illcxj is uniformly bounded above, 
log(2p/ri) < log(pVn), and Wn — )• oo. Thus, applying Lemma 8, noting the relation 
above, we have 

pr (maxi^jscp E^ix^je'^) > 1 + Un \ X) s$ pr ^ max^ |£;„{a;|(e^ - 1)}| > u„ | 

S% n + T2 = o(q!). 

Part 2. Let t„ = $^^(1 — ce/2p) and for some w„ — >■ oo slowly enough let 
Un = 'Wna~'^^'^n~^^'^ \og^^'^{n\/ p) < 1/2 for n large enough. Thus, 



pr 



{a> {l + un)il + l/t„)n'/h„\x\ ^pr max -Pj^^^i^^ > (1 + l/t„)f„ I X 



+ 



+pr 



2 2x >, 1/2 



> l + U„\ X 



By the same argument as in part 1 we have 



pr 



max ■ 



>l + Un\X 



o{a). 



To bound the first term above, by Condition SN, we use that for n large enough 
(1 + l/tn)tn + 1 n^^^yenmiiici^j^p{Eni\xij\^)E{\ei\^)y/^ where ^„ -> oo slowly 
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enough. Thus, by the union bound and Lemma 7 



pr 



max ' ^ > 1 + i„ 



X 



^ p max pr 



n^'^\Er,{x,e)\ 



{i?„(x2e2)}i/ 



> i„ + 1 I X 



< 2p$(i„ + 1) 1 



A 



< Q;exp(-t„ - 1/2)- 



tr 



1 



1 



'tn + l {' ' (t„ + l)2 

= a{l + o(l)}cxp(-t„-l/2), 
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where the last incquaUty foUows from standard bounds on $ and calculations and 
calculations similar to those in the proof of Lemma 1 of the main text. 
Therefore, for n sufficiently large, 

pr{A > (1 + u„)(l + l/tn)n^^Hn \X} <a 

so that Apoil -a\X)s^ (1 + u„)(l + l/tn)n^/Hn = {1 + o{l)}n^/Hn. □ 



Appendix E. Omitted Proofs from the Main Text 

E.l. Omitted Part of Proof of Lemma 1. Claim in the proof of Lemma 1: For 
independent random variables ^ N{0, 1) {i ~ 1,. . . ,n) and any < r„ < 1, we 
have 

pr{£;„(e2) < (1 - rrf} ^ cxp{-nrl/4). 

It follows from 

pr{E,,{e^) < (1 - r„)2} ^ pr{E„{e^) < 1 - 2r„ + r^} 

^w{Enie^) <l-rn} 
= pr{S„(e2-l)<-r„} 

= Pr{Er=i(e? - 1) < 

- P^Eti - 1) < -2|a|2V^r„/2} 

where we have = 1 (i = so that |a|2 = V^i- Applying the second 

inequality of Lemma 1 of [12] for = Y^r„/2, we have 

pr{£;„(e2) < (1 - r„)2} ^ cxp(-nr2/4). 



E.2. Omitted Part of Proof of Lemma 2. To show statement (iv) of Lemma 
2, it suffices to show that for any v' > 1, pr(cA > ci/'n^/^t„ | X) = o{a), which 
follows analogously to the proof of statement (iii) . 



28 



ALEXANDRE BELLONI, VICTOR CHERNOZHUKOV AND LIE WANG 



Indeed, for some constants 1 < i' < v' , 
pr(cA > cv'n^/'^tn \ X) 

s^(i) p max w{\n^l'^En{xje)\ > Uv | X} + pr \{En{e'^)Y''^ < vjv' 
^ p max pr{\n^/^Enixje)\ > t^iy \ X} + pr{En{e'^) < {vjv' f} 
s$(2) p max pY{\n'^'^ En{xjt)\ > tnV I X} + o{a) 



= (3) 



2p 4>(i„i^){l + o(l)} + o{a) = o{a) 



where (1) holds by the union bound; (2) holds by the application of the Rosenthal 
and Vonbahr-Esseen inequalities: 



pr 



provided that 



o{n 



|A(f-l) 



}• 



To verify relation (3), by Condition M and Slastnikov-Rubin-Sethuraman's theorem 
on moderate deviations, wc have that uniformly in ^ \t\ ^ fclog^/^n for some 
< q — 2, uniformly in 1 < j ^ p and for any F = Fn e F, pT{n^^^\En{xje)\ > 
t}/{2^(t)} 1, so the relation (3) holds to for t = t^v ^ {2\og{2p/ a)Y/'^v 
y{r]{q — 2) logn}^/^ for ?y < 1 by assumption, provided v is set sufficiently close to 
1 so that v^r] < 1. 

When g > 4, for large n we can also bound pr{\En{e^ — 1)| > {i^/i^')'^} by 
pr{\En{e^ — 1)1 > r„} where r„ = fc{logn/n}^/2^ ^ — 2, and invoking 
the Slastnikov's theorem as previously, which gives pr{\En{e'^ — 1)| > {t^/i^')'^} ^ 
= o(a) if l/a = 0(71'^ ) = o{ni~'^). 



Taking the best conditions on l/a gives the restriction: 



a 




(38) 



Appendix F. Tools Used 
F.l. Rosenthal and Von Bahr-Esseen Inequalities. 

Lemma 4. Let Xi, . . . ,Xn he independent zero-mean random variables, then for 
r ^ 2 

Y^X, U C(r) max ^i?(|X,r),K]£;(Xf) 



E 



This is due to [18]. 

Corollary 5. Let r ^ 2, and consider the case of identically distributed zero-mean 
variables Xi with E(Xf) = 1 and E{\Xi\'^) bounded by C . Then for any — >■ 00 



2C{r)C 

£r 



0. 
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To verify the corollary, we use Rosenthal's inequality E ^il^) ^ Cn"^/^, 

and the result follows by Markov inequality, 

p > ^ C{r)Cn^f' ^ C{r)C ^ 

Lemma 5. Let Xi, . . . ,X„ be independent zero-mean random variables. Then for 
1 <r 



E 



^(2-n-i).^i?(|X,r). 



fc=i 



This result is due to [26]. 



Corollary 6. Let r G [1,2], and consider the case of identically distributed zero- 
mean variables Xi with E{\Xi\^) bounded by C. Then for any in ^ oo 



pr 



n 



The corollary follow by Markov and Vonbahr-Esseen's inequalities, 

F.2. Moderate Deviations for Sums. Let Xni, i = 1, . . . , n; n ^ 1 be a double 
sequence of row-wise independent random variables with E{Xni) = 0, E[Xf^^) < oo, 
i = 1, . . . , fc„; n ^ 1, and — ^i^ni) — !• oo as ri — !• cx). Let 

F„{x) = pr (^^^tn < xB,^ . 

The following result is due to [21]. 
Lemma 6. Lf for sufficiently large n and some positive constant c, 

£i?{|X„,|2+=V(|X„,|)log-(i+=')/2(3 + |X„,|)} < g{B,,)Bl 

where p{t) is slowly varying function monotonically growing to infinity and g{t) = 
o{p{t)} as t — > oo, then 

1 - Fn{x) ^ 1 - <^>{x), Fn{~x) <S>{-X), n ^ OO, 

uniformly in the region ^ x ^ c{log-B^j}^/^. 

The following result is due to [21] and [19]. 

Corollary 7. If q > + 2 and Y,"^!^ £;(|X„j|«) KBl, then 

1 - F„(2;) - 1 - <I>(x), F„(-a;) - $(-2;), n oo, 

uniformly in the region ^ x ^ c{log B^}^^^ ■ 

Remark. Rubin- Sethuraman derived the corollary for x = ijlogi?^}^/^ for fixed 
t. Slastnikov's result adds uniformity and relaxes the moment assumption. 
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F.3. Moderate Deviations for Self-Normalizing Sums. We shall be using the 
following two technical results. The first follows from Theorem 7.4 in [7] which is 
based on [9]. 

Lemma 7. Let Xi^n, ■ ■ ■ , Xn^n be the triangular array of independent non-identically 
distributed zero-mean random variables. Suppose 

Af„ i"y^"'~^FC| y ''"|3')ii/3 ^ "''^'^ •'^"'^^ £n~^ CO we have n^l^M^Iln > 1- 

Then there is a universal constant A such that uniformly onO ^ x ^ ?i^/^Af„/£„ — 1, 
the quantities 

n n 
i=l i=l 

obey 

The second follows from the proof of Lemma 10 given in the working paper 
"Pivotal Estimation of Nonparametric Functions via Square-root Lasso", arXiv 
1105.1475v2, by the authors, which is based on symmetrization arguments. 

Lemma 8. Let (i = l,...,ri,j be independent identically distributed random 
variables such that E(e^) = 1 and sup„^2 -^(kd') < co for q ^ A. Conditional on 
xi, . . . ,Xn S TR^ , with probability 1 — 4ti — 4t2 
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Figure 3. Average relative empirical risk of infeasible lasso (dots), 
square-root lasso (solid) , post infeasible lasso (dot-dash) , and post 
square- root lasso (solid with circle), with respect to the oracle esti- 
mator, that knows the true support, as a function of the standard 
deviation of the noise a. In this experiment we used Toeplitz cor- 
relation matrix E^fe = (1/2) l-?"*^!. 
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Figure 4. Average relative empirical risk of infeasible lasso (dots), 
square-root lasso (solid) , post infeasible lasso (dot-dasli) , and post 
square- root lasso (solid with circle), with respect to the oracle esti- 
mator, that knows the true support, as a function of the standard 
deviation of the noise a . In this experiment we used equicorrelated 
correlation matrix Sj^ = (1/2). 
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Comparison of square-root lasso to cross-validation choice of A 
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Figure 5. Average relative empirical risk of square-root lasso 
(solid), post square- root lasso (solid with circle), cv-lasso (dots), 
post cv-lasso (dot-dash), and square- root lasso (1/2) (dashes), with 
respect to the oracle estimator, that knows the true support, as a 
function of the standard deviation of the noise a. In this experi- 
ment we used Toeplitz correlation matrix T,jk = (l/2)l-'~*''l. 
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Figure 6. Average relative empirical risk of square-root lasso 
(solid), post square-root lasso (solid with circle), 1-step lasso 
(dots), post 1-step lasso (dot-dash), 2-step lasso (dashes), post 2- 
step lasso (dots with triangle), with respect to the oracle estimator, 
that knows the true support, as a function of the standard devia- 
tion of the noise a. In this experiment we used Toeplitz correlation 
matrix E^-fc = {\/2)\i-^\. 



